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known sequence contexts for the purpose of SNP 
screening The reaction is designed as a premix that 
contains all of the components except primers ana 
templates. The completed reaction identifies one 
Sonde located V relative to the- primer ate. We have 
reformulated our SNaPshot reagent mix to enable robust 
multiplex SNP interrogation against multiple templates in 
varying amounts. The resulting multiple products can 
then be analyzed by electrophoresis in the presence of a 
size standard; labeled with a 5th dye. Evaluations on ABI 
P^Ln Models 310, 377, 3100 and 3700 have been 
successful. Topics including throughput, reaction format, 
primer design and template requirements will be covered. 

5-Dye System Compatibility across ABI 
PRISM® instrument Platforms 

A. Wheaton, D. Wei, C. Holt, S. Menchen, P. Kenny, B. 
Rosenbhnn, P. Hanachi, G. Mitra, G. Ayanglou, P. Dong, 
PE Biosystems, Foster City, CA 

We describe the implementation of our new GeneScah™* 
W dye syS^aSss all ABI PPJSM® instalment 
platformi. In the 5-dy« system ithecurrent D dye set 
?6FAM™, HEX, NED™, and ROX™) becomes the G5 
let f or higher throughput hi major GeneScan applications. 
G5 Sc^oTAM^VIC^, NED™ PET& and the 
new 5th-dye labeled size standard. Ll^y^e, the cvnrent 
I!dve7et CdRllO, dR6G, dTAMARA™, and dROX™) 
K&ftfES is* with the addition of the. 
standard to facflitate automated data vutyna- Tb ^& 
IvScm incorporates 5-dye data collection, 
rSSg methods, and enhanced analysis tools tn .enable , . 
variety of new 5-dye applications. Benefits of the 5-dye 
IyS include optimal spectral resolution, nd»bto 
intensity, and increase! , throughput, across i all ABI 
PRISM® instrument platforms. We *&*y the 
performance of this new five-dye system across the ABI 
PRISM® 310 DNA Sequencer, the ABI PRISM® 377, 
3700, and newly introduced 3 100 DNA Analyzers. 

P-179 

The Comprehensive Microbial Resource 

Owen White, Jeremy Peterson, Jonathan A Eisen and 
Steven L. Salzberg, The Institute for Genomic Research, 
Rockville, MD 



One of the challenges presented by large-scale g^ome 
seonencinR efforts is the effective display of mformatoon 
mT format that is accessible to the laboratory scientist. 
Conventional databases offer the ^ust ^e r^s to 
search for a particular gene, sequence, or orgarnsm, but 
dohtde in trie way of displaying the vast amount of 
curated inforaiation that are becoming available • TIGR 
has developed methods to effectively "shoe' the vast 
amounts of data in the sequencing databases m a wide 
variety of ways, allowing the user to formula* ; q^ nes 
that search for specific genes as well as to investigate 
broader topics, such as genes that .might serve^ vaccine 
and drug targets. The Omniome database contains all of 



the fully sequenced microbial genornes, the curation from 
the orighxaTsequencing centers, and further curatoon from 
TIGR (for those genomes sequenced outside ilOK;. lne 
web presentation of the Omniome includes the 
comprehensive collection of bacterial genome sequences 
curated information, and related uiformatics 
methodologies. The scientist can view genes witnro a 
genome «& can also link across to related genes in o&er 
genomes. The effect is to be able to construct queries that 
include sequence searches, isoelectric point, GC-content, 
GC-skew, functional role assignments .growth conditions 
environment and other questions, and isolate the genes of 
interest. The database contains extensive curated date as 
well as pre-run homology searches to facilitate data 
Sng The interface allows the display of the results * 
nmnerous formats that will help the user ask more 
accurate questions. This resource should be of value to 
the scientific community to design experiments and spur 
further research. Resources of this type are an i waentoj 
tool to make sense of bacterial genome information as the 
number of completed genomes continues to grow. 

How Deep Is Deep Enough: Criteria for 
planning EST Sequencing Projects 

JoseDh A. White, Catherine M. Ronning, and John 
oSenbusl T The Institute for Genomic Research, 
Rockville, MD 

To evaluate the yield of unique sequences obtained from 
cDNAMhrarics, sequences were selected at random from 
Sown lftraries ofESTs. After assemblmg the sequences 
So^ntigTthe numbers of contigs and smgletons were 
SSnteS We observed tie following: 1) the percentages 
Of^rnque sequences and singleton sequences always 
declines sample size is increased, 2) ^number °* 
contigs increased linearly over the range of sarnoles , sras 
selected for this study, 3) the number of smgj™ 
approaches a plateau as the sample size is increased^) 
dSnt cDNA hbraries have sipimcantly different 
numbers of unique sequences for the same sample size, 
^sTooolini samples from different cDNA libraries 
ScreaLf^e^nbe^and percentage of unique sequences 
for^he same sample size. Unique sequences are dermed 
S the number of contigs plus the number of ^gleton^ 
Although this measure of umque sequences is connrnonly 
used, it^ affected greatly by the number of singletons, 
whfch VL obtervelto vary with library and sample 
Although useful, a better measure of uniqueness needs to 
be obtained. 



ALtomatlon of Processes in a Core DNA 
Sequencing Facility 

c„,o nne P Williams 1 , Yvette C Clancy", Melissa T 
S KevinJ Laddison 1 , Alison E. Maunce 1 David 
p"S' Ke Xu 1 , Michael Polchanmoff 2 , Judith A. 
NoSa^'HowJd D. Cash 4 , Beth A. Oaik<, JPte 
Groton^ CT, Visual Technologies, Portland, CI, Vh 
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Informatics, Foster City, CA, "Gene Codes, Ami Arbor, 
Michigan 

Several of the processes in a core DNA Sequencing 
facility have been optimized to accommodate the 
increased throughput made possible with the 3700 DNA 
Analyser. Improvement of several processes will be 
reported: sample submission, reaction set up, data 
management and DNA Analysis. Both in-house 
development and commercial products were considered. 
The Pfizer DNA Sequencing Facility generates and 
analyses DNA sequence from samples submitted by 
scientists site wide. Manual entry of the sample 
information for the 373 and 3700 Data Collection 
Software, PE Biosystems, can take up to an hour a day. 
A web program has been developed for sample 
submission. The submission information can now be 
copied and pasted to populate the Collection Sample 
Sheets. Manual performance of Terminator sequencing 
chemistry is a labor intensive process requiring up to two 
hours of hands on time per day. Several robots have been 
evaluated for their ability to automate terminator 
chemistry. 

The Qiagen 9600 BioRdbot, Packard Muloprobe II and a 
custom in-house robot have been considered. The 
increase of throughput made possible with the 3700 DNA 
Analyser has increased the need for an improved data 
management system. A commercial automated data 
management and analysis system, PE Informatics 1 
BioLIMS, is being evaluated. BioLIMS is a centralized 
relational database with an open modular architecture. 
BioLIMS can reduce the need for multiple copies of files, 
is searchable and can be set up to automate tasks such as 
data rrrniming, back-up and archiving. Sequencher for 
BioLIMS is the version of Gene Codes* DNA Analysis 
software package that allows integration with BioLIMS. 
Performance and features of the Sequencher software 
program will be presented. 
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Estimation of the Confidence Limits of 
Oligonucleotide Array-Based Measurements 
of Differential Expression 

Paul K. Wolber, Andrew S. AtwelL Cynthia Y. 
Enderwick, Glenda C. Delenstarr, Andreas N. Dorsel, 
Karen W. Shannon, Robert H. Kincaid, Chao Chen, Shad 
R. Schidel & Michael P. Aschoff, Agilent Technologies 
Palo Alto, CA 

Microarrays of oligonucleotide probes, can be used to 
simultaneously infer the differential expression states of 
many mRNA's in two samples. Such inferences are 
limited by systematic and random measurement errors. 
Systematic errors include signal gradients, imperfect 
feature morphologies, mismatched sample concentrations, 
cross-hybridization and scanner bias. Random errors 
arise from chemical and scanning noise, particularly for 
low signals. We have used a combination of two-color 
labeling (with fluor xchange) and rational array design to 
minimize systematic errors from gradients, imperfect 
features and mismatched sample concentrations. On-array 



specificity control probes and careful probe design were 
used to correct for cross-hybridization. Random errors 
were reduced via automated bad feature flagging and an 
advanced scanner design. We have scored feature 
significance, using established statistical tests. We have 
then estimated the intrinsic random measurement error as 
a function of average probe signal via sample self- 
comparison experiments (human K-562 cell "mRNA). 
Finally, we have estimated the accuracy of differential 
expression measurements between K-562 cells and HeLa 
cells by evaluating the consistency with which different 
probes to the same mRNA measure differential 
expression. The data establish the importance of the use 
of sensitive probes and the elimination of systematic 
errors in producing reliable estimates of differential 
expression. 

P-183 

Negative Selection of Intact mRNA for Full- 
Length cDNA Library Construction 

Ning Wu, Troy Moore, Shannon Wang, MaryAnn 
Taylor, Dewight Cowley n, Mandy B. Hammons, Batty 
S. Pitts, Leslie A. Crow, Monaz V. Baria, Jeneco S. 
Thomas, James R. Hudson, Jr, Research Genetics, Inc., 
Hunts ville, AL 

In the process of generating full-length cDNA horary, the 
selection of intact mRNA is die essential step. Many 
methods have been developed focusing on the 
manipulation of the intact 5' end cap structure of die 
mRNA (i.e. "Oligc-capping", "Cap trapper", etc.). We 
have developed a novel method for intact mRNA 
selection based on the elimination of uncapped RNAs. A 
negative selection strategy that removes bom uncapped 
mRNA and other non-mRNA molecules that present a 
phosphate at me 5' end has been applied in the mRNA 
purification procedures. A biotinylated 

oligoribonucleoude (r-oligo) is ligated to the 5* end 
phosphate by utilizing T4 RNA Ligase. By using 
streptavidin extraction and phenol/chloroform 
purification, all truncated mRNA and non-mRNA are 
removed from the intact mRNA. We have applied this 
methodology in construction of a mouse brain cDNA 
library. The sequence analysis of 137 clones revealed 
that 15% of clones displayed no matches in both "NR" 
and "dB EST' databases. 47% of the clones are known 
genes and within them, the full length clones total more 
man 68%. 5' ends of the known genes analyzed are from 
-305 to +196, further sequence data is under analysis, 

P-184 

Dynamically Organizing Gene Family 
Research Literature for Arabldopsis thaliana 

Dongying Wu, Daniel Haft, Maria-Ines Benito, Owen 
White, The Institute for Genomic Research, Rockville, 
MD 

Whole genome-scale gene family analysis is one of the 
most important approaches for understanding protein 
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Bioinformatics of the sugarcane EST project 

Guilherme P. Telles*, Marilia D. V. Braga, Zanoni Dias, Tzy-Li Lin, Jose A. A. Quitzau, 
Felipe R. da Silva andJoao Meidanis 



Abstract 

The Sugarcane EST project (SUCEST) produced 291 ,904 expressed sequence tags (ESTs) in a consortium that involved 74 sequencing 
and data mining laboratories. We created a web site for this project that served as a 'meeting point* for receiving, processing, analyzing, 
and providing services to help explore the sequence data. In this paper we describe the information pathway that we implemented to 
support this project and a brief explanation of the clustering procedure, which resulted in 43,141 clusters. 



INTRODUCTION 

The application of expressed sequence tag (EST) 
technology has proven to be an effective tool for gene dis- 
covery (Adams et al., 1 99 1 ), gene mapping (Schuler, 1 997) 
and the generation of gene expression profiles (Boguski 
and Schuler, 1995). 

EST projects are usually conducted by a single labo- 
ratory, which prepares the cDNA libraries, isolates and se- 
quences clones, analyzes the data and submits it to 
GenBank. However, the Sugarcane EST project (SUCEST) 
involved the cooperation of 24 sequencing laboratories, a 
bioinformatics laboratory, a coordinating laboratory, 50 
data mining groups scattered throughout Brazil and an in- 
ternational relations group. A new Brazilian bioinformatics 
group also became associated with the project during a later 
phase. Starting early in 1999, in 15 months the SUCEST 
project generated 291,904 sequences from 260,352 clones 
from 37 different libraries. 

Brazilian genome research has been consor- 
tium-based since its first project, the sequencing of the 
complete genome of the phytopathogenic bacterium 
Xylellafastidiosa (Simpson et al, 2000), conducted by the 
Organization for Nucleotide Sequencing and Analysis 
(ONSA network). Although a consortium-based genome 
project provides a larger number of researchers, technicians 
and sequencing machines it demands a much more orga- 
nized data flow. In the SUCEST project, the Bioinformatics 
Laboratory (Laboratorio de Bioinformatica - LBI) was re- 
sponsible for receiving data from a network of sequencing 
laboratories, assessing quality, storing and clustering the 
data, and providing many other services. In this paper these 
tasks are described in some detail and quantitative figures 
from the project are given. 



METHODS 

Computational systems 

For a short time in the beginning of the project, the 
SUCEST web site was hosted by a personal computer with 
128 MB of memory running the Linux operating system 
(Red Hat 6.2) but now the site resides on a Compaq 
AlphaServer ES40 with two Alpha 667 MHz processors, 8 
GB of RAM and 384 GB of hard-disk storage space run- 
ning OSF-1 operating system version 4.0G. However, the 
bulk of the project was executed on a Compaq AlphaServer 
DS20 with two Alpha 500 MHz processors, 4 GB of RAM 
and 144 GB of hard-disk storage space running OSF-1 ver- 
sion 4.0F. Since this was the system on which most of the 
tools were developed we will concentrate on it for the rest 
of the paper. 

The Web engine server is Apache (www.apache.org) 
version 1 .3.9. Programs were written in Perl version 5.005 
(www.cpan.org), and PHP version 3.0.12 (www.php.net). 
The database management system is MySQL version 
3.22.26a (www.mysql.com). 

Input data consisted of data received through web 
forms, including chromatograms produced by ABI 377 se- 
quencing machines (Applied Biosystems), and data mining 
reports in HTML format. 

The base calling and sequence extraction programs 
used were phred version 0.980904.e (www.phrap.org) and 
phd2fasta version 0.990622.d (www.phrap.org). The se- 
quence comparison programs used were cross-match ver- 
sion 0.990319 (www.phrap.org) and blastall version 
09/19/1999 (www.ncbi.nlm.nih.gov) that implements the 
BLAST algorithm (Altschul et al., 1997). Assembly pro- 
grams were phrap version 0.990319 (www.phrap.org) and 
CAP3 (Huang and Madan, 1999). Off-the-shelf scripts 
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were used to provide search by keywords in the reports pro- 
duced by data mining groups, database administration and 
other minor tasks. Each piece of software used is either free 
for academic purposes or was developed by our team. 

Computational methods 

From a computational point of view, SUCEST may 
be seen as a large data repository and as a provider of 
Intemet-based services for a community of different users. 
Figure 1 shows the major relationships between users, ser- 
vices, data and programs in the project. 

There are several types of users: members of sequenc- 
ing laboratories who submit chromatograms from clone li- 
braries, members of data mining laboratories who perform 
searches on the project database and publicize their results 
in data mining reports, and members of the project coordi- 
nation team who monitor the status of the project and the 
distribution and validation of control plates. These users in- 
teract with data through services that add to, retrieve from, 
and update the data repositories. 

Data include sugarcane ESTs, information about pro- 
ject members, data mining reports, control data, summaries 
and the output from programs that perform automated 
searches in databases, organize the sequences into clusters 
and the clusters into categories. In the following para- 
graphs we describe the users, data, and SUCEST services 
and programs, showing how they interact. 

DEFINITIONS 
Objects 

In the SUCEST project data is stored in two different 
kinds of repositories: operating system directories and a re- 
lational database. The directories hold biological sequence 
files, results from BLAST and cross-match searches in bio- 
logical databases, and data mining reports. Biological se- 



Users 



Services 



Data 
repositories 



Programs 



njT^x — Submission 

^bmissiopX^ directories 




Figure 1 - Major relationships between users, services, data and programs 
involved in the SUCEST project. Arrows indicate the flow of information. 



quence files include chromatograms, files in a standard for- 
mat called fasta format (www.ncbi.nlm.nih.gov/ 
BLAST/fasta.html), quality files, and files generated by 
clustering, categorization and comparative genomics pro- 
cedures. The project uses only one relational database, with 
several interconnected tables that store other biological and 
management data, e.g. libraries, sequencing plates and data 
on laboratories and their members. The database also points 
to data in directories. The major entities (objects) in our da- 
tabase are described below, where we also introduce quan- 
titative figures and details from the project's pipeline. 

Laboratories 

There are 78 laboratories involved in the SUCEST 
project that belong to one or more of five groups: the DNA 
Coordination Group, the Bioinformatics Group, the Data 
Mining Group, the Sequencing Group and the International 
Cooperation Group. Each participating laboratory is identi- 
fied by a two-letter code. The services and data that a mem- 
ber of a particular laboratory can access depend on the 
group to which the laboratory belongs. A member of each 
laboratory is designated as being the head of the unit in- 
volved in SUCEST-related work and receives notification 
of some of the activities performed by the laboratory mem- 
bers. 

Members 

A SUCEST member is a person who belongs to at 
least one laboratory. Several members belong to both a se- 
quencing laboratory and a data mining laboratory. Data 
held on members include their name, the laboratories to 
which they belong, their e-mail address, phone numbers 
and a login name and password to grant access to autho- 
rized services. SUCEST had 256 members as at March 25, 
2001. 

Libraries 

The ESTs included in the SUCEST database came 
from 37 different libraries prepared from different sugar- 
cane tissues under different conditions (Vettore et al., 
2001). The name and description of the library and vector 
employed in cloning were recorded for each library. Each 
library received a two-letter code indicating the tissue from 
which the library was derived, together with a consecutive 
number assigned for every new library derived from the 
same tissue. For example, LR1 indicates that the library 
came from leaf roll (LR) with long inserts (library 1) while 
LR2 shows that the library came from leaf roll (LR) with 
small inserts (library 2). There are three possibilities for the 
status of each library: 'test' for validating libraries, 'start' 
for libraries released for sequencing and 'stop' when the 
DNA Coordination Group decides it is not worth continu- 
ing to sequencing a distributed library. Of the 37 libraries 
prepared for the project, 32 were started and 5 were aban- 
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doned after the "test' phase. Those not formally started ei- 
ther produced too much redundancy or very small reads. 

Plates 

SUCEST clones are organized in 96-well plates that 
hold clones from the same library in an 8 x 12 grid. Se- 
quencing is done for a whole plate and the data is sent to the 
LBI for processing and storage. Data for a plate include the 
library that it came from and the laboratory that is autho- 
rized to send data on this plate. A plate has a three-digit 
identification tag, except for control plates (see below), 
which have the letter 'C and two digits. The SUCEST data- 
base holds data from 2,771 different plates. 

Reads 

Reads are the same as ESTs and are extracted using ' 
the phred program from chromatograms submitted by the 
sequencing laboratories and screened for vectors with the 
cross-match program. All reads are stored in directories as 
chromatogram files and also as a pair of text files holding 
the sequence and its quality in fasta format. For every read 
the following attributes are stored in the database: the plate 
and the position on the plate where the read came from; in- 
formation about the submission process (e.g. date and time 
of submission); the number of vector and non-vector bases 
with phred quality equal to or higher than 20; the number of 
vector and non-vector bases with phred quality less than 20; 
the starting and ending positions for every vector sequence 
identified in the read and whether or not the read has rele- 
vant data (see preparation sheet below.) 

Every read has a name that is a concatenation of its 
laboratory, library and plate codes, plate position and read 
direction (5* or 3'). For example, reading from right to left, 
the string SC ACAD 1001 AO l.g is the name for the 5* read 
(3' uses .b as a suffix.) of the clone in well A01 of plate 001 
of library AD1 , sequenced by laboratory AC. The prefix SC 
stands for sugarcane. Every position on the plate is identi- 
fied by its row (A to H) and column (01 to 12). 

Preparation sheet 

Before a laboratory can sequence and submit a plate, 
it must provide a sheet of information about the process 
used to prepare the plate. There are records in the database 
for every well where bacteria did not grow and for the wells 
from which it was not possible to obtain DN A. Every well 
marked as a problem corresponds to a sequence without in- 
formation relevant to the project. 

Control plates 
For every set of 12 plates a control plate is built using 
the 8 th column of each controlled plate, so 1 2 columns make 
one control plate that is sequenced. The sequences from 
both control and controlled plates are compared against 
each other using cross-match, and the matches are stored in 
the database. A criterion, based on the matches distribution 



over the control and controlled plates, was established to 
automatically mark plates that probably had tracking and 
naming errors due to plate preparation and sequencing pro- 
cesses. Matches distributions could be visualized via a web 
service, and plates with problems could be fixed and resub- 
mitted by the laboratory that produced them. 

Clusters 

SUCEST reads are grouped by the clustering proce- 
dure described below, which creates sets of aligned reads 
that we call clusters. In our database we store the reads that 
are part of each cluster. Moreover, in addition to being a set 
of reads, a cluster has an alignment and a consensus se- 
quence. Alignments, consensus sequences, and quality files 
are stored in cluster directories. A cluster also has a name, 
which is equal to the name of oldest read in the cluster. 

Services and programs 

Data enter and are retrieved from the SUCEST data 
repository through a set of services available on web pages 
hosted at LBI. Data is also generated within the LBI by pro- 
grams that are executed either automatically or manually. 
Brief descriptions of these services and programs are pre- 
sented below and provide a general overview on how the 
SUCEST web site is organized and how it works. 

Data retrieval 

Data is retrieved from the SUCEST database in units 
called 'objects' which are the same as the data entities de- 
scribed above under 'Definitions'. Each object has its own 
web page containing information about the object and links 
to any other object, service or report directly related to it. 
Starting from a laboratory or library object it is possible to 
reach the web page of any other object. Some objects point 
to pages that include data extracted from the directory 
structure of the project. For instance, one can visualize 
reads and its qualities in many versions: immediately after 
submission but before screening, after screening but before 
trimming (see below under 'Clustering and Trimming') 
and after trimming. For clusters, it is possible to see the 
reads in a cluster and their alignments, including the con- 
sensus. 

An object search service was created to allow direct 
access to any object. Given the code and the type of the ob- 
ject, the service delivers its page. For the 'Member' object 
type it is possible to search by name, email, department, 
city or institution. 

Besides objects, some reports that summarize data are 
also available for the project: the Summary of Submitted 
Reads gives totals per laboratory or per library of submit- 
ted, payable and clusterizable reads, and the Summary of 
Control Plates gives the totals of accepted and rejected 
control plates. 
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SUCEST database users who are SQL (Structured 
Query Language) literate may take advantage of a service 
that allows generic queries to the database. Queries can be 
typed in a web form and the results are returned in tabular 
fashion. Entity-relationship diagrams and table descrip- 
tions for our database are available to help users in this task. 

Sequences submission 

Sequences are submitted by sequencing laboratories 
only, the submission process requiring the user to access 
the project's web site using a valid login/password pair to 
upload a set of 96 chromatograms (i.e. one plate). When an 
upload finishes certain pre-requisites are verified: all chro- 
matograms must belong to the same plate, the laboratory 
that is trying to submit a plate must be the one authorized to 
do so, the preparation sheet for that plate must have already 
been submitted and the reads must be in accordance with 
the naming conventions. 

If the pre-requisites are satisfied, the phred and 
phd2fasta programs are used to extract the sequences and 
their qualities in fasta format from chromatograms and the 
cross-match program is used to mask vector sequences in 



the reads. These steps take only a few minutes (this time has 
essentially been constant during the project because the 
analysis done upon submission does not depend on the 
other reads present in the repositories). 

After submission analysis, a report that summarizes 
the process and the sequences received is presented to the 
submitter who is asked to confirm the submission or not. If 
the submission is confirmed, the database is updated and if 
there is an older version of the plate it is replaced. Direc- 
tories are updated as well. If the submission is not con- 
firmed {e.g., if the submitter is not happy with the quality 
assessment) the submission is discarded. 

Figure 2 shows the path followed by a read in the LBI, 
starting from the submission. The submission procedure 
corresponds to the part of the figure starting at 'Zip file', 
extending through top line and reaching the 'Report Gener- 
ator'. Other steps in the diagram are performed by pro- 
grams described in the following sections. 

Clustering and trimming 

Clustering of ESTs is important to reduce the amount 
of sequence data that miners have to look at, and to orga- 
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nize the reads in a less redundant set. In the SUCEST data- 
base, clustering had as an additional motive the need to 
estimate the level of redundancy in the libraries. 

Early on two pivotal decisions were made, the first 
being that each cluster should reflect a transcript rather than 
a gene, allele or other biological entity while the second 
was that a cluster consists not only of a set of reads but also 
of an alignment of these reads. 

In this context, our first scheme was to group similar 
transcripts and to produce consensus sequences using the 
assembly program phrap. This strategy was sufficient in the 
early stages of the project but, as data accumulated, a series 
of problems forced us to change the scheme, as described 

below. , 

To minimize artifacts, reads were trimmed before 
clustering. This trimming procedure started with vector 
masking using the cross-match program followed by re- 
moval of some of the poly-A, vector and adapter regions. A 
quality trimmer was also applied, removing bases from the 
ends of the sequence one by one until there were at least 12 
bases with phred quality above 1 5 in a window of 20 bases 
at the end. Reads were also checked for contamination 
against Xylellafastidiosa, Xanthomonas citri, Escherichia 
coli and other potential contaminants that could be present 
in the laboratories that produced the libraries. BLAST was 
used to compare the reads and potential contaminators and 
if a match of at least 100 bases and more than 90% identity 
occurred the read was marked as probably being due to con- 
tamination. However, marked reads were kept in clustering 
and subsequent analyses to allow data miners to decide for 
themselves whether or not a specific read was contami- 
nated. 

Trimmed reads weTe assembled using the phrap pro- 
gram with quality and stringent arguments (-penalty -15 
-bandwidth 14 -minscore 100 -shatter_greedy). Every 
contig and singlet produced by phrap was taken as a cluster 
As new plates came in, a program automatically updated 
the database, directories and BLAST results for every clus- 
ter that changed and was already in the database. Initially, 
clustering was performed every day but as the set of se- 
quences grew the updates became sparser, running once a 
week. In the final phases of the project, clustering would 
typically occupy an entire processor for about 20 hours. 

The last assembly done with phrap included 261,609 
trimmed reads and produced 81,223 clusters. However, 
changes were made due to remarks made by several mem- 
bers of the project that the total number of clusters in the da- 
tabase was unreasonably large, that many clusters were 
malformed and that some clusters appeared as if they could 
be combined. These changes are described in detail by 
Telles and da Silva (2001). The new scheme was based on 
careful testing and evaluation, and consisted of a more elab- 
orate trimming procedure, the use of the CAP3 assembler 
(Huang and Madan, 1 999), which is the same tool used to 



produce TIGR's gene indices (Quackenbush et al., 2000). 
Trimming in this new procedure included ribosomal RNA 
removal, comprehensive removal of poly-A, poly-T, vector 
and adapter regions and improved low-quality-end trim- 
ming. CAP3 was fed with 237,954 reads and their quality 
data and produced 43, 1 4 1 clusters. 

Both clustering versions are accessible through the 
project web site, with data from both methods available for 
most services. 



Keyword search 
Keyword search is a service that allows users to 
search for a set of keywords in the header lines of every se- 
quence in NCBI's nr, nt and dbEST databases (www.ncbi. 
nlm.nih.gov) that hits any cluster in SUCEST. To perform a 
query the user gives a database name (nr, nt or dbEST), a 
logical expression of keywords (that may include 'or' and 
'and' connectors) and the maximum e-value required (an 
optional parameter which defaults to le-5 = 10" ). The ser- 
vice then returns the clusters that have a hit with the ex- 
pected or better e-value, and whose subject heading 
contains words satisfying the logical expression. The re- 
sulting list of clusters is ordered by e-value. 

A program was created for keeping BLAST results 
against nr, nt and dbEST up to date for all SUCEST clus- 
ters A BLAST result against a certain database is consid- 
ered outdated for a SUCEST cluster if the cluster was 
newer than the result or if the cluster or the database were 
modified after the last BLAST run. When the program finds 
outdated BLAST results it builds a queue giving priority to 
older clusters. If the databases are on different computers 
the system is able to reduce the processing time by running 
several BLASTs in parallel (one on each remote server) and 
takes about 2 or 3 days. If the databases are on a single com- 
puter, BLAST searches take considerably longer. 

Subclustering 
This service is used to evaluate statistics about sub- 
sets of clusters obtained by clustering, including read fre- 
quency by cluster size, total reads, total clusters, 
redundancy and novelty. To select the subset of clusters, 
the user has to indicate the reads that belong to the clusters. 
Any cluster that contains a read in the selection is included 
in the evaluation. To locate reads, one or more elements 
(laboratory, library, plate, position and direction) from their 
names should be selected, e. g. selecting a particular labora- 
tory will generate the statistics for the clusters that have at 
least one read sequenced by that laboratory. 

BLAST search 
A BLAST service allows searches against SUCEST 
reads, reads in their trimmed version and cluster consensi. 
These databases were updated automatically on a daily ba- 
sis to incorporate new reads and consensi. 
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Data mining report 

Data mining groups submit HTML formatted reports 
to the SliCEST site and update them periodically. Users 
may access reports through an index page that provides ac- 
cess to the reports of every data mining group and a key- 
word search is also available. When a report archive is 
uploaded a service takes care of unpacking the files and up- 
dating the index page and the search index. Information 
about reports is also kept in the SUCEST database, includ- 
ing the name and a summary of the project, its members and 
a submission date and submitter name. 

Categorization 

SUCEST members tried to categorize the clusters in 
the project, in an attempt to determine their function and to 
aggregate information. Thirty categories were defined, and 
32,438 proteins with known function were selected from 
public databases to serve as examples in each category. 
Public databases included MIPS Arabidopsis thaliana data- 
base (mips.gsf.de), Clusters of Orthologous Groups - func- 
tional annotation (www.ncbi.nlm.nih.gov/COG/), EGAD 
cellular roles (www.tigr.org/docs/tigr-scripts/egad_scripts 
/rolereport.spl) and others. 

Categorization was achieved by two methods: auto- 
matic and manual. In automatic categorization a database 
was constructed containing the proteins selected from pub- 
lic databases and a BLAST search was performed against 
this database using SUCEST clusters as input. Any cluster 
was considered to be in category X if it matched a category 
X sample protein with an e-value better than or equal to 
10 10 and covered 70% or more of the example. A cluster 
could be in many different categories. This method catego- 
rized 36% of the 43, 14 1 clusters. For manual categorization 
a web service was built to allow manual annotation when 
automatic annotation produced ambiguous categorization 
or produced no categorization at all. Based on BLAST re- 
sults against the nr database, SUCEST members were able 
to establish a direct relation between a cluster and a cate- 
gory. Manual annotation significantly increased the num- 
ber of categorized clusters and as of March 20 th , 2001, 
60.5% of the clusters were categorized. 

Comparative genomics 

To obtain information on sugarcane and its relation- 
ship to other species, SUCEST cluster consensi were com- 
pared against other organisms. The first organism selected 
for comparison was the model plant Arabidopsis thaliana. 
Every cluster consensus was BLASTed against A. thaliana 
chromosomes, proteins and ESTs. Clusters that produced 
no matches against A. thaliana, were also BLASTed 
against ESTs from Lycopersicon esculentum, Glycine max, 
Lotus japonicus, Hordeum vulgare, Oryza sativa, Sorghum 
bicolor, Zea mays, Triticum aestivum and Medicago 
truncatula. Results from these searches were inserted in our 
database, allowing queries to determine the distribution of 



these hits per library, per cluster, or some other grouping 
criteria. 

Management 

These services provide a way for the DNA Coordina- 
tion Group to input management information into the 
SUCEST database. This information is used mainly by ser- 
vices that perform checking and summarizing operations. 
Using the library management services, the DNA Coordi- 
nation Group modifies the status of any library and assigns 
plates to sequencing laboratories. Manual plate approval is 
also possible via a service that displays control and con- 
trolled plates showing which cells match in control and 
controlled plates. 

DISCUSSION 

A key aspect of the project was the close interaction 
between the biological laboratories and the LBI. Discus- 
sion lists or telephone calls were used so that users could 
give suggestions for new services and quickly point out 
problems with the services (broken links, bugs, etc.) This 
daily, intensive interaction was undoubtedly one of the 
main reasons for the success of the project. 

Clustering started early and had a dramatic impact 
during the project. Re-clustering on a regular basis de- 
manded designing and implementing programs to update 
databases and BLAST results against the nr, nt and dbEST 
databases, and also used a lot of processor time. When an- 
other clustering scheme was adopted the web site had to 
change to accommodate both versions simultaneously and 
to show relationships between clusters in different versions 
and both bioinformatics and data mining staff needed some 
time to adapt to the changes. 

The two most important lessons learnt during the 
SUCEST project were 'avoid changing systems' and 'keep 
reference sequences, not cluster lists' which we will discuss 
in more detail in the following paragraphs. 

Avoiding changes in the systems is important. During 
this project we had to change the underlying computing 
system twice, the first time from a personal computer to a 
medium-sized server and then from this to a larger server. 
These changes caused many problems, e.g. programs that 
used to work on one system would not work on the other 
system, users had to get used to new addresses etc. The mi- 
gration process proved time-consuming and error-prone. 
Our advice would be to set up a system that is big enough 
right from the start and keep the project there for as long as 
possible. To minimize the impact of migration it is impor- 
tant to devise the directory structure in a system-inde- 
pendent way, for instance data can be placed in directories 
that will not conflict with system directories and programs 
can be installed in standard locations and execution path 
variables used to assure they will work. Another important 
piece of advice is to use software that combines many phys- 
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ical disks into one big volume of, say, a few hundred giga- 
bytes. Most vendors provide such software for a small fee. 

It is also important to keep reference sequences in- 
stead of lists of clusters. In this project, data accumulated at 
a fast rate and clustering was redone frequently. Some data 
mining groups had problems trying to keep up with the fre- 
quent updates because they maintained lists of relevant 
clusters. Each time the clustering was redone some clusters 
would disappear (merge into larger ones) or the read com- 
position of a cluster would change, requiring a lot of man- 
ual labor. Our advice would be to use reference sequences 
from Genbank or another stable sequence database, which 
can then be used as queries to retrieve the cluster lists via 
BLAST. Proceeding in this way lists can be quickly recon- 
structed from the reference sequences using automated 
methods. 

There are many other programs, not presented here, 
that contribute to the functionality of the SUCEST web site. 
Some services and programs have already been disabled 
(e.g. the sequence submission and plate control programs) 
but others, such as the keyword search, BLAST and report 
submission programs are still being used by data mining 
laboratories and will be used by the international commu- 
nity when the web site goes public. This will certainly 
transform the meeting point of the project's community 
into the meeting point of a wider group which will produce 
new demands for services and data storage. 
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RESUMO 

O projeto SUCEST (Sugarcane EST Project) pro- 
duziu 291.904 ESTs de cana-de-acucar. Nesse projeto, o 
Laboratorio de Bioinformatica criou o web site que foi o 
"ponto de encontro" dos 74 laboratorios de sequencia- 
mento e data mining que fizeram parte do consorcio para o 
projeto. O Laboratorio de Bioinformatica (LBI) recebeu, 
processou, analisou e disponibilizou ferramer.tas para a 
exploracao dos dados. Neste artigo os dados, servicos e 
programas implementados pelo LBI para o projeto sao 
descritos, incluindo o procedimento de clustering que ge- 
rou 43.141 clusters. 
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The libraries that made SUCEST 

Andre L. Vettore'-", Felipe R. da Silva' b , Edson L. Kemper' 1 ' and Paulo Arruda' 4 



Abstract 

A large-scale sequencing of sugarcane expressed sequence tags (ESTs) was carried out as a first step in depicting the genome of this 
important tropical crop. Twenty-six unidirectional cDNA libraries were constructed from a variety of tissues sampled from thirteen 
different sugarcane cultivars. A total of 29 1,689 cDNA clones were sequenced in their 5' and 3'end regions. After trimming low-quality 
sequences and removing vector and ribosomal RNA sequences, 237,954 ESTs potentially derived from protein-encoding messenger 
RNA (mRNA) remained. The average insert size in all libraries was estimated to be 1 ,250bp with the insert length varying from 500 to 
5,000 bp. Clustering the 237,954 sugarcane ESTs resulted in 43,1 41 clusters, from which 38% had no matches with existing sequences in 
the public databases. Around 53% of the clusters were formed by ESTs expressed in at least two libraries while 47% of the clusters are 
formed by ESTs expressed in only one library. A global analysis of the ESTs indicated that around 33% contain cDNA clones with 
full-length insert. 



INTRODUCTION 

Single-pass sequencing of cDNAs to generate "ex- 
pressed sequence tags" (ESTs) has proven to be a powerful, 
economical and rapid approach to identify genes that are 
preferentially expressed in certain tissue or cell types of 
multicellular organisms (Adams etai, 1991, Hwang etai, 
1 997, Liew et al. , 1 994, Adams et al. , 1 995). Increasing im- 
portance has also been attributed to ESTs as a tool for the 
annotation of complete genome sequences of mammalians 
and plants. Unique ESTs provided biological evidence of 
hundreds of predicted genes, newly discovered genes, or 
transcript isoforms leading to considerable advance in gene 
identification mission in multicellular organisms (Andrews 
et al., 2000). Today, more than ten million ESTs are cur- 
rently available through the dbEST entry of GenBank 
(http://www.ncbi.nlm.nih.gov/dbEST/dbEST_sumary.htm 
1); however, only 14% of dbEST release 022301 of Febru- 
ary 23, 2001 corresponds to plant sequences. 

Another useful aspect of ESTs is in accessing genetic 
information of species with a complex genome, whose ac- 
cess is difficult using conventional genetics. This is the case 
of sugarcane, an important crop that is cultivated in the 
tropics for its high sucrose accumulation in the stalk. 
Among the cultivated crops, sugarcane possesses perhaps 
one of the most complex genomes (for a review see Grivet 
and Arruda, 2002). Modern sugarcane cultivars are hybrids 
derived from the crossing of Saccharum qfflcinarum, usu- 



ally having 2n = 80 chromosomes and Saccharum sponta- 
neum, 2n = 40 - 1 28 chromosomes. In view of the structural 
differences between chromosomes of the two species, the 
hybrids possess different proportions of chromosomes, 
varying chromosome sets and complex recombinational 
events (Grivet and Arruda, 2002). This imposes tremen- 
dous difficulties in applying conventional plant breeding 
techniques to sugarcane. 

As a first step in depicting the sugarcane genome, the 
ONSA consortium (Simpson and Perez 1998) launched in 
September of 1 998 the Sugarcane Expressed Sequence Tag 
project (SUCEST), aiming at sequencing random ESTs and 
identifying around 50,000 unique genes (http://sucest.lad. 
ic.unicamp.br/en/). 

To improve the probability of getting a maximum 
number of different ESTs, researchers have been using nor- 
malized and/or subtracted cDNA libraries that bring the 
frequency of each clone in a cDNA library within a narrow 
range (Soares and Bonaldo 2000). However, normalization 
and/or subtraction procedures are in general laborious and 
have the tendency of increasing the proportion of small in- 
sert clones. In the SUCEST project we have implemented 
an efficient procedure to generate conventional cDNA li- 
braries to generate large scale ESTs from sugarcane. This 
paper describes the construction of these libraries, repre- 
senting all major organs, harvested at different develop- 
mental stages and used to generate one of the largest plant 
EST collections. 
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MATERIAL AND METHODS 

Plant material 

Sugarcane tissues were obtained from commercial 
cultivars (Table I) grown at the Copersucar experimental 
station (Piracicaba, SP, Brazil), at the Universidade Federal 
de Sao Carlos experimental station (Serra do Ouro, AL, 
Brazil) and at the Centra de Biologia Molecular e 
Engenharia Gen6tica (Campinas, SP, Brazil). After har- 
vesting, tissues were frozen in liquid nitrogen and stored at 
-80 °C. 

RNA isolation 

Total RNA was isolated using Trizol (Invitrogen) ac- 
cording to manufacturer's instructions. Due to the high car- 
bohydrate content and the presence of phenolic 
compounds, total RNA from immature seeds was isolated 
according to the method described by Manning (1991). 

Poly(A) + mRNA was purified from total RNA using 
Oligotex-dT (Qiagen) according to manufacturer's instruc- 



tions. Purity and RNA integrity were assessed by 
absorbance at 260/280 nm and agarose gel electrophoresis. 

cDNA library construction 

Libraries were constructed using the Superscript 
cDNA Synthesis and Plasmid Cloning Kit (Invitrogen) ac- 
cording to the manufacturer's protocols. One microgram of 
poly(A) + mRNA was reverse-transcribed using a poly-dT 
primer containing the Notl site. The efficiency of cDNA 
synthesis was monitored with radioactive nucleotides. The 
second cDNA strand was then synthesized by replacing the 
RNA in the hybrids with DNA by using a combination of 
RNase H, DNA Polymerase I and DNA Ligase. After the 
second-strand synthesis and ligation of Sail adapters, 
cDNA was digested with Notl, generating cDNA flanked 
by Sail sites at 5' ends and Notl sites at the 3' ends. Excess 
adapters were removed and cDNAs were size fractioned in 
a 40 cm long 1 mm ID Sepharose CL-2B columm. One 
hundred and fifty |iL fractions were collected and 8 \iL 
aliquots of each fraction was electrophoresed in 1.5% 



Table I - Description of the SUCEST Libraries. 



Library code 
AD1 

AMI, AM2 
CL6 

FL1, FL3, FL4, FL5, FL8 
HR1 

LB1.LB2 
LR1.LR2 
LV1 



RT1.RT2.RT3 

RZ1.RZ2 

RZ3 

SB1 

SD1.SD2 



Library name 



Description 



Sugarcane variety 



G. diazotroficansl 

Apical Meristem 
CaHi 

Flower 1,3, 4, 5 and 8 

H. rubrisubalbicansl 

Lateral Bud 1 and 2 
Leaf Roll 1 and 2 
Leaf I 



Root 1,2 and 3 
Root to shoot 
zone 1, 2 and 3 
Stalk Bark I 
Seeds 1 and 2 



ST1, ST3 



Stem 1 and 3 



Mixture of tissues from root to shoot zone, stem and 
apical meristem of plantlets cultivated in vitro and in- 
fected with Gluconacetobacter diazotrofieans 

Apical meristem of young plants 

Pool of calli treated for 12 h at 4 °C and 37 °C in the 
dark or ligth 

Flowers harvested at different developmental stages 

Mixture of tissues from root to shoot zone, stem and 
apical meristem of plantlets cultivated in vitro and in- 
fected with fferbaspirilum diazotrofieans 

Lateral buds from mature plants 

Leaf roll from immature plants 

Etiolated leaves from plantlets grown in vitro 



0.3 cm-length roots from mature plants and root apex 
Root to shoot zone of young plants 

Stalk bark from mature sugarcane plants 
Developing seeds 



First and fourth intcrnodes of immature plants 



P70-1143 3 

SP80-3280 2 
SP8O-3280' 

SP80-87432 1 
PB5211 XP57150-4 1 

SP70-1 143 3 



SP80-3280' 

SP80-3280 1 

SP83-5077 
SP80-185 
SP87-396 
SP80-3280 

SP803280xSP81-544l' 

SP80-3280 1 

SP80-3280 1 



SP80-3280 2 

CB47-89 
RB855205 
RB845298 
RB805028 4 

SP80-3280' 



cDNA libraries were constructed from different tissues sampled from different varieties grown at Copersucar experimental station (Piracicaba-SP)', 
CBMF.G - Universidade Estadual de Campinas (Campinas-SP) 2 , Universidade Federal do Rio de Janeiro (Rio de Janeiro-RJ)', and Universidade Federal 
de Sao Carlos experimental station (Serra do Ouro-AL) 4 . 
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agarose gel to determine the size range of cDNAs. Frac- 
tions with cDNAs with a minimum size of 500 base pair 
(bp) were pooled and ligated to pSPORTl vector 
(Invitrogen) prcdigested with Sail and Notl. The resulting 
plasmids were transformed in DH10B cells (Fnvitrogen)by 
electroporation. Unamplified libraries were plated and in- 
dividual colonies picked and transferred to 96 well plates 
containing liquid Circle Grow (CG) medium (BIO 101), 
supplemented with 100 mg/L of ampicillin and 8% glyc- 
erol. Three copies of each cDNA clone were stored at 
-80 °C. 

Template preparation and DNA sequencing 

DNA template preparations and sequencing reactions 
were performed in a 96-well format. Plasmid templates 
were prepared using modified alkaline lysis (http://sucest. 
lad.ic.unicamp.br). Sequencing reactions were performed 
on plasmid templates using a quarter of the standard vol- 
ume of ABI Prism BigDye Terminator Sequencing Kit 
(Applied Biosystems) and the T7 promoter primer 
(5 '-TAATACG ACTCACTATAGGG-3 ') that hybridizes 
upstream of the Sail site in the pSPORT 1 polylinker (5 'end 
of the cDNA inserts) or the SP6 promoter primer 
(5 '- ATTTAGGTGAC ACTATAG-3 ' ) that hybridizes 
downstream of the Notl site (3 'end of the cDNA inserts). 
Reaction products were precipitated with 95% ethanol us- 
ing sodium acetate (3M) and Glycogen (lg/L) as carriers 
and washed twice with 75% ethanol before drying under 
vacuum. The sequencing reaction products were analyzed 
on 377-96 ABI Sequencers. 

Sequence analysis 

Sequencing of sugarcane ESTs was performed by 23 
laboratories located in Universities and Research Institutes 
of the State of Sao Paulo and sequences were processed by 
the Bioinformatics laboratory (LBI) located at Instituto de 
Computacao, Universidade Estadual de Campinas. A de- 
tailed description of the methods used to receive, process, 
analyze, and display the sequences along with additional 
tools to help explore the sequence data can be found in this 
issue (Telles et ai, 2001, Telles and da Silva, 2001). 

RESULTS AND DISCUSSION 

The SUCEST strategy 

EST programs to acquire information about the trans- 
criptome has been carried out for hundreds of organisms in- 
cluding plants and mammals. In most of the cases 
unidirectionally cloned cDNA libraries have been prepared 
using bacterial or phage vectors, so that the 5' and/or 3' end 
of the clones can be sequenced. Since single pass reads re- 
sult in average -350 high quality nucleotides, sequencing 



3' ends covers mainly the untranslated region of the tran- 
script. Moreover, the 3 'end of the cDNA clones contain a 
long poly- A tail that is useless in terms of biological infor- 
mation and in general introduces technical difficulties in 
the sequencing process. However, because the untranslated 
3 'end represent the less conserved region of the transcripts 
it is useful, for example, to avoid misassembly of reads 
coming from highly conserved sequences from members of 
gene families. Sequencing 5' ends of unidirectional cDNA 
clones, on the other hand, can be of great advantage for 
large scale EST projects. Since the 5' untranslated region is 
shorter, it is likely that it contains protein-coding se- 
quences. In addition, because a large proportion of clones 
present partial cDNA sequences, it is possible to collect 
enough information to assemble the full consensus se- 
quence of a transcript, increasing the likelihood that data- 
base searches will result in the assignment of their putative 
functions. Based on this assumptions we decide sequence 
the 5' end of the cDNA clones to build up the SUCEST da- 
tabase. 

The libraries 

Table I shows the description of the libraries used in 
the SUCEST project. A variety of tissues were sampled 
from different cultivars, in order to access transcript infor- 
mation of genes expressed in many biological systems. 
Two libraries AD1 and HR1 were constructed using tissues 
from in vitro cultured plantlets infected with Glucona- 
cetobacter diazotroficam and Herbaspirilum diazotrofi- 
cans. These are endophytic nitrogen fixing bacteria that 
colonize sugarcane tissues (Lee, et ai, 2000). Sequencing 
from these libraries could lead to discovery of genes in- 
volved in plant-bacteria interaction and in nitrogen assimi- 
lation in sugarcane. Libraries AMI, AM2, LBI and LB2 
were constructed using apical meristem of young plants 
and lateral buds from adult plants. These libraries shall con- 
tribute with genes expressed at the initial stages of organ 
differentiation. Calli produced from sugarcane meristems 
was used in an experiment devised to access genes induced 
by cold and heat. Two weeks old calli was incubated at 4 °C 
or 37 °C for 12 h. Part of the tissues was maintained in the 
dark and part in continuous light. The CL6 library was pre- 
pared with a mixture of equal amounts of RNA extracted 
from these tissues and it is expected that this library will 
contribute with genes induced by cold and heat. FL1 , FL3, 
FL4, FL5 and FL8 are libraries constructed from flower tis- 
sues harvested at different developmental stages and may 
contribute with genes expressed in this important plant or- 
gan. To access information on genes expressed in leaves, 
we constructed LR1 and LR2 libraries from leaf roll of 
adult plants and LV1 from etiolated leaves of plantlets 
grown in vitro. A collection of libraries representing roots 
or tissues from which roots emerge are represented by RT1 , 
RT2 and RT3 which are libraries constructed from roots 
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sampled from plantlets grown in vitro or plants grown in 
greenhouse, while RZ1, RZ2 and RZ3 were constructed 
from root to shoot zone of young plants grown in green- 
house. SB1 is a library constructed from stalk bark of adult 
plants and may contribute with genes involved in the syn- 
thesis of cell wall components including waxes. SD1 and 
SD2 are libraries constructed from developing seeds. 
Finally, we constructed the libraries ST1 and ST3 from first 
and fourth internodes of adult plants at the time of intense 
sucrose synthesis and accumulation. 

Quality control 

Large-scale sequencing demands care with the qual- 
ity of biological materials and accurate performance at each 
step of the process, both to provide sequence data of the 
highest possible quality and to detect or avoid mistakes 
(Adams et al, 1995). At each step of the SUCEST project, 
from tissues sampling to sequence analysis, quality control 
and evaluation procedures were used to assess the accuracy 
of the data. The goal of the SUCEST project was that 
cDNA libraries should contain all sequences present in the 
initial poly(A) + mRNA population, which is useful to ac- 
cess expression profile through electronic Northern; unidi- 
rectionally cloned so that the orientation of each cDNA is 
known, facilitating subsequent sequence analysis; include a 
large proportion of full-length inserts; and reveal low levels 
of contamination with genomic or ribosomal RNA. Table II 
shows the quality control steps used during cDNA library 
construction and sequencing. Tissues were quickly frozen 
in liquid nitrogen, RNA quality analyzed by different meth- 
ods and the cDNAs were synthesized and size selected 
using special gel filtration columns. cDN As were unidirec- 
tionally cloned in pSPORT plasmidial vector and intro- 
duced into DH10B competent cells. Libraries with title less 
than 1 x 30 4 were discarded. Colonies were placed into 96 
well plates and stored at -80 °C. A sample of -400 clones 
from each library was examined to evaluate library quality, 
such as percentage of clones with no inserts, percentage of 
ESTs with exact matches to sequences derived from ribo- 
somal RNA species, E. coli or bacteriophage lambda, per- 
centage of ESTs with no significant matches to any 
sequence in the public databases, and an estimate of the 
number of clusters that contain a full-length coding region 
sequence. Libraries selected for EST analysis typically ex- 
hibited a broad diversity of transcripts (no single gene or 
small group of genes dominating the distribution), a low 
percentage of clones with no insert, a low percentage of ri- 
bosomal RNA clones, and no evidence of contamination 
with sequences from other organisms. The libraries that did 
not meet these general criteria were discarded. 

Sequencing in the SUCEST project was carried out 
using ABI377 sequencers, which are prone to error during 
gel tracking. To minimize errors the 8 ,h row of each 96 well 
plates was used to build control plates that were rese- 



Table II - Quality control and evaluation of SUCEST libraries. 



Parameter 
Tissue sampling 

Poly(A)* RNA purification 
cDNA synthesis 
cDNA library construction 



Quality control and evaluation 



Library storage 



Sample sequencing 



Clone address 



Template preparation 



Tissues snap frozen quickly after harvest- 
ing 

Purity and RNA integrity were assessed 
by absorbance at 260/280 nm and agarose 
gel electrophoresis 

Tracer levels of 32 P used; agarose gel ex- 
amination for degradation; column chro- 
matography for size selection 

Blue/white screen for inserts; PCR to 
check insert sizes; libraries must contain 
at least 1 0 5 recombinants 

All clones were grown in 96 well plates 
containing CG media supplemented with 
8% glycerol > Plates were stored at 
-80 °C in triplicate 

Around 400 clones of each library were 
sequenced to check gene diversity, con- 
taminations and rRNA 

One clone in each twelve was 
resequenced to detect putative address 
mistakes 

DNA quality and concentration checked 
by agarose gels 



Quality control procedures for each step in the EST process are listed with 
specific points of evaluation or standards to be met. 



quenced. Computer analysis was then used to check the ad- 
dress match. These allowed the SUCEST project to keep 
the address error to less than 5%, so that a sequence in the 
computer corresponds, with high fidelity, to a clone in the 
freezer. 

SUCEST data set 

Table III shows the summary of the complete data set 
of the SUCEST project. A total of 259,325 cDNA clones 
were sequenced in their 5'end region and 32,364 of them 
had also their 3 'end region sequenced. Therefore, the pro- 
ject produced 29 1 ,689 ESTs. After trimming of low-quality 
sequences and removal of vector and ribosomal RNA 
ESTs, 237,954 ESTs potentially derived from pro- 
tein-encoding messenger RNA (mRNA) remained. This 
represents a success index of 81.56%, which is comparable 
with other EST projects worldwide. Before entering the se- 
quencing pipeline, each SUCEST cDNA library was evalu- 
ated for the average size of cDNA inserts. cDNA libraries 
that contained an average insert size below 500bp were dis- 
carded. The average insert size in all libraries was estimated 
to be 1 ,250bp (n = 4,000) (Table III). The distribution of the 
insert length was between 500 and 5,000bp. In order to 
clone genes encoding low molecular weight proteins, we 
constructed some cDNA libraries (LR2, RZ2 and SD2 - See 
Table IV) with an average insert size of 855bp. 
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Table III - Summary of SUCEST data. 



Analyzed data 

Total ESTs 291,689 

5' ESTs 259,325 

3' ESTs 32,364 

ESTs remaining after trimming quality control 237,954 

Average insert size, bp 1,250 

Average EST length, bp 750 

Average EST bases with Phred quality > 20 365 



Numbers of sequenced cDNA clones and generated ESTs from 26 librar- 
ies constructed from different sugarcane tissues. 259,325 ESTs were gen- 
erated by sequencing the 5' end of cDNA clones. Another 32,364 ESTs 
were generated by sequencing the 3 ' end of cDNAs clones. The average 
insert size was calculated for 400 cDNA clones from each library. The 
EST length and the number of bases with Phred quality > 20 was calcu- 
lated from the total EST set. 

After the trimming process, all new sequences were 
compared to the previous sequences that had already been 
deposited in the SUCEST database. Every time that an EST 
was similar to a sequence that already existed in the data- 
base, both were grouped together in a cluster. As noted in 
Table V, the 237,954 valid sequences were assembled into 
43, 141 clusters. 

Each cluster consensus sequence was compared 
against the non-redundant nucleotide and peptide databases 
(GenBank) using the programs BLASTN and BLASTX. 
Sequences that did not match these databases were further 
compared against the dbEST. Using a blast E- Value thresh- 
old (Altschul et al, 1997) equal to or below e-5, of the 
43,141 SUCEST clusters, 26,525 (61.5%) had matches 
with an existing sequence in GenBank (Table V). There- 
fore, 16,616 (38.5%) of the SUCEST clusters could poten- 
tially represent new genes. These values are comparable to 
those found for ESTs sequences from other organisms 
(Hwang etal. 2000; Adams et al. 1 992; Claverie 1 996). As- 
cribing functions to those anonymous sequences has there- 
fore become one of the major bottlenecks in plant and 
animal genomics. 

Tissue and cellular differentiation depend on specific 
patterns of gene expression. Therefore, in large-scale EST 
sequencing, sampling many different tissues and in differ- 
ent physiological conditions increases the chance to pick up 
transcripts rare in one cell type but less rare in another. 
SUCEST database was built up with sequences derived 
from 26 libraries constructed from different tissues sam- 
pled at different developmental stages (Table I) and an av- 
erage of 10,000 clones were sequenced from each library. 
Sequencing from many libraries resulted in a novelty ratio 
as good as the ratios found in other EST projects that used 
normalized libraries (Bonaldo et al, 1996). 

Around 53.2% of the SUCEST clusters were formed 
by ESTs expressed in at least two libraries. This suggests 
that these genes are being coordinately expressed in differ- 



Table IV - Characteristics of the SUCEST libraries. 



Library 
code 


Average 
insert size 
(bp) 


Sequenced 
clones 


Valid 
reads 


Success 
index 
(%) 


Novelty 
(%) 


AD1 


1,330 


18,144 


14.701 


81.02 


55.34 


AMI 


1,300 


12,480 


10,881 


87.19 


55.05 


AM2 


- 


15,648 


13,403 


85.65 


49.45 


CL6 


1,150 


7,392 


5,518 


74.65 


63.62 


FLI 


1,400 


18,528 


15.343 


82.81 


54.82 


FL3 


1,340 


13,056 


10,727 


82.16 


53.26 


FL4 


1,370 


16,896 


13,964 


82.65 


52.19 


FL5 


1,180 


10,080 


7,744 


76.83 


66.05 


FL8 


1,400 


5,184 


4,652 


89.74 


72.26 


HR1 


- 


12.000 


9,729 


81.08 


52.11 


LBI 


1,150 


7,488 


5,879 


78.51 


62.91 


LB2 


1,660 


10,560 


8,953 


84.78 


60.33 


LR1 


1,240 


14,112 


11,701 


82.92 


56.85 


LR2 


870 


4,128 


3,418 


82.80 


68.13 


LV1 


1,260 


6,432 


4,557 


70.85 


67.32 


RT1 


1,450 


8,640 


7,255 


83.97 


58.26 


RT2 


1,400 


12,288 


10,606 


86.31 


54.86 


RT3 


1,000 


10,560 


7,441 


70.46 


58.54 


RZ1 


1,290 


3,168 


2,831 


89.36 


71.07 


RZ2 




5,760 


5,031 


87.34 


63.14 


RZ3 




15,168 


12,862 


84.80 


50.75 


SB1 




16,320 


13,189 


80.81 


56.16 


SD1 


1,240 


11,040 


8,601 


77.91 


51.84 


SD2 


840 


10,368 


8,505 


82.03 


48.19 


ST1 


1,050 


8,448 


6,933 


82.07 


62.87 


ST3 


1,350 


12,000 


8,939 


74.49 


50.55 



The average insert size of each library was determined in a sample of 400 
clones by gel electrophoresis of the clones digested with PvuII.Valid reads 
are defined as reads containing at least 1 40 bp with Phred quality > 20. The 
success index is the number of valid reads in relation to the number of 
clones sequenced. The Novelty represents the probability of a new se- 
quence to be founding in the library. 

ent tissues or that they are expressed in response to specific 
physiological conditions or developmental requirements. 
On the other hand, 46.8% of the clusters (Table VI - the sum 
of specific contributions) are formed by ESTs expressed in 
only one library. This suggests that these ESTs could corre- 
spond to genes expressed in a tissue/time fashion, varying 
in different tissue/physiological conditions. Nonetheless, 
these data should be analyzed taking into account that 
16,338 (37.9%) are singletons, therefore representing rare 
transcripts. The uniformity in the amount of singletons in 
the different libraries (Table VI) strengthens the value of 
the approach adopted. 

A global analysis of all SUCEST clusters indicated 
that around 33% contain cDNA clones with full-length in- 
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Table V - Statistics of EST clustering and contiging. 



ESTs analyzed 237,954 

Total clusters (C+S) 43,141 

Clusters with at least 2 reads (C) 26,303 

Singletons (S) 16.838 

C+S sequences finding homolog in GenBank 26.525 

C+S sequences with no homolog in GenBank 1 6.6 1 6 

C+S with full length insert 14.409 



ESTs were clustered using CAP3 assembler (Huang and Madan, 1999). 
The ff-valuc cut of threshold to be considered for C or S as having 
homology to other proteins in the nr GenBank database using BLASTX 
was (<10 5 ). Clones were considered as having a putative full length insert 
when their sequences started within the first 1 5 amino acids of their hit in 
GenBank. C or S were considered as having tentative full consensus se- 
quence when their sequences started within the first 15 amino acids and 
finished within the last 1 5 amino acid of their hit in GenBank. 



Table VI - EST clustering in the individual libraries. 



Library code Number of Unique clus- Number of Specific con- 
clus ters ters singl etons tribu tion (%) 



AD1 


10,736 


3,120 


2.821 


3.84 


AMI 


7,870 


1.930 


1,726 


2.37 


AM2 


9,079 


2,389 


2.012 


2.94 


CL6 


4,282 


1,231 


1,112 


1.5! 


FL1 


11,438 


3,740 


3,468 


4.60 


FL3 


7,847 


2,178 


1,997 


2.68 


FL4 


10,145 


2,626 


2,407 


3.23 


FL5 


6,489 


1,697 


1,589 


2.08 


FL8 


3,963 


811 


780 


0.99 


HRI 


6.697 


1,664 


1,434 


2.04 


LB1 


4.697 


1,149 


1,074 


1.41 


LB2 


7,056 


1,749 


1.597 


2.15 


LR1 


8,867 


2,250 


2,104 


2.77 


LR2 


2.901 


696 


662 


0.85 


LV1 


4.005 


1,037 


950 


1.27 


RT1 


5,706 


1,435 


1,336 


1.76 


RT2 


7.851 


2,081 


1,875 


2.56 


RT3 


5,699 


1,398 


1,251 


1.72 


RZ1 


2,374 


448 


.426 


0.55 


RZ2 


4,054 


939 


869 


1.15 


RZ3 


8,858 


2,331 


2,094 


2.86 


SB1 


10.204 


2,910 


2,774 


3.58 


SD1 


6, 114 


1,600 


1,451 


1.96 


SD2 


S.696 


1.856 


1.539 


2.28 


ST1 


5,682 


1,431 


1,341 


1.76 


ST3 


6,124 


1,335 


1,253 


1.64 



The number of clusters that contain one or more reads from a specific li- 
brary is indicated, as well as, the clusters that were formed only by reads of 
a specific library (Unique Clusters). The number of clusters that were 
formed by only one read (Singleton) is also indicated. The specific contri- 
bution is calculated dividing the Unique Clusters of each library by the to- 
tal number of clusters (43.141). 



serts (Table V). This is in accordance with the results ob- 
tained in the mouse EST project (Marra et al., 1999). 

This collection of 237,954 ESTs provides us with a 
preliminary view into the gene expression profile of sugar- 
cane. The identification of genes involved in different cel- 
lular processes suggests that the generation of large-scale 
ESTs should provide valuable insights into the molecular 
mechanisms of plant function and development. 
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Trimming and clustering sugarcane ESTs 
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Abstract 

The original clustering procedure adopted in the Sugarcane Expressed Sequence Tag project (SUCEST) had many problems, for 
instance too many clusters, the presence of ribosomal sequences, etc. We therefore redesigned the clustering procedure entirely, 
including a much more careful initial trimming of the reads. In this paper the new trimming and clustering strategies arc described in 
detail and we give the new official figures for the project, 237,954 expressed sequence tags and 43, 1 4 1 clusters. 



INTRODUCTION 

The Sugarcane EST project (SUCEST) produced 
291,689 expressed sequence tags (ESTs) (Adams et al., 
1 99 1 ). In the pipeline of the project it was important to clus- 
ter together sequences from the same transcript molecule 
and to obtain a representative sequence for each group. 
Clustering was important to evaluate the redundancy of the 
set of ESTs during library production and sequencing, and 
at the end of the project. Clustering also produces a smaller 
set of sequences which facilitates investigation of the data 
by biologists and computer scientists (Telles et al., 2001). 

As in any other EST project, the raw SUCEST se- 
quences sometimes contained unwanted segments like 
polyadenylation (poly-A), regions with low base quality, 
fragments from vectors and adapters, and slippage. Some 
reads may also came from ribosomal RNA or contaminant 
DNA. Such segments are unwanted because they introduce 
similarity between ESTs that has no relevance for cluster- 
ing, and removal of such segments is essential to cluster 
correctly. 

Trimming and clustering procedures were established 
at the beginning of the SUCEST project in July 1999, but 
the amount of data grew each day and it soon became clear 
that the trimming and clustering procedures were both not 
good enough. SUCEST data-users were pointing out many 
problems when we designed and implemented new trim- 
ming and clustering procedures. 

A trimming procedure is essentially the task of sear- 
ching ESTs for unwanted regions, identifying them and 
then deciding whether to remove the unwanted region or to 
discard the entire EST. Trimming has already been de- 
scribed for UniGene (www.ncbi.nlm.nih.gov/UniGene), 
TIGR Gene Indices (Quackenbush et al., 2000) and 
STACK (Miller et al., 1 999). 



In the SUCEST project, clustering was always per- 
formed using a fragment assembler for the whole set of 
ESTs. This is different from the procedure used by Uni- 
gene, TIGR Gene Indices, JESAM (Parsons and Rodri- 
gues-Tome, 2000) and STACK which use some kind of 
pairwise comparison to estimate distance between ESTs, 
build clusters and then, if ever, assemble the clusters sepa- 
rately. In its first version, SUCEST clustering scheme pro- 
duced 81,223 clusters (41,582 singletons) while the current 
version has 43,141 clusters (1 6,838 singletons). 

In this paper we describe trimming in detail, because 
it had a major influence on the work performed by the as- 
sembler at the clustering stage. We have also compared the 
results of different assemblers for our set of ESTs before we 
decide in favor of the CAP3 program (Huang and Madan, 
1 999). Although we had confidence in the fragment assem- 
blers comparison performed by Liang et al. (2000), three is- 
sues motivated us to produce our own comparison routines. 
Firstly, we wanted to examine the assembly results for our 
particular set of ESTs, secondly, we were using ESTs qual- 
ity data and, thirdly, we used parameters for the assemblers 
that differ from the default ones. We also introduce the 
trimming and clustering procedures early in the project. 
Our intention in this paper is not to emphasize our im- 
proved results but to show the remarkable effect that 'noi- 
se' (i.e. unwanted sequences) can have on clustering. 

METHODOLOGY AND RESULTS 

Clone libraries were prepared as described by Vettore 
et al. (2001) and sequenced by ABI 377 (Applied 
Biosystems) machines. After being processed by the phred 
base-calling program (version 0.980904.e, www.phrap. 
org) and by the phd2fasta program (version 0.990622.d, 
www.phrap.org), ESTs were stored as fasta and quality 
files in the 5' to 3' orientation. These files contained 
291,689 sequences with an average length of 864.5 ± 186.3 



' Bioinformatics Laboratory. Institute of Computing, UN/CAMP, CP 6 J 76. 13083-970 Campinas, SP. Brazil. 
'Center for Molecular Biology and Genetic Engineering, UNICAMP. CP 6010. 13083-970 Campinas. SP. Brazil. 
Send correspondence to Guilherme P. Telles. E-mail: pimentel@ic unicamp.br. 



Material may be protected by copyright law (Title 1 7, U.S. Code) 



Tcllcs and Silva 



bases. The average number of bases with a phred quality 
value greater than 20 per read was 399.5 ±151.3. The pro- 
grams were run on an 8 GB RAM AlphaServer ES40 
(Compaq) with 2 processors at 667 MHz executing the 
OSFl operating system (version 4.0G). 

Trimming 

An EST set may contain unwanted sequences made 
up of poly-A fragments, vector and adapter fragments, low 
quality ends, ribosomal RNA, contaminant DNA and 
slipped sequences. When clustering the sequences to pro- 
duce groups of transcripts, these unwanted sequence intro- 
duce irrelevant relationships between reads. Trimming is 
the removal of such regions from ESTs or the removal of 
entire ESTs from the set. 

Trimming refined the reads in several steps, using the 
blastalt program (version 10/31/2000, www.ncbi.nlm. 
nih.gov) that implements the BLAST algorithm (Altschul 
et al, 1997), the cross-match program (version 0.990319, 
www.phrap.org), the SWAT program (version 0.990319, 
www.phrap.org) and ad hoc pattern-matching programs 
written in Perl (version 5.6.0, www.cpan.org). Parsers (pro- 
grams that do some kind of interpretation on data based on 
its syntactical structure) for the output of these programs 
were written in Perl, and bash (version 2.04.0(1), 
www.gnu.org) scripts were used to filter, build histograms 
and summarize data. Some regions, like poly-A, were 
searched several times, each time with a different recogni- 
tion criterion. Trimming was tuned to keep as much as pos- 
sible from each sequence. 

The trimming scheme is summarized in Figure 1 . The 
first step was the removal of ribosomal RNA sequences, 
and for this the ESTs were compared against 18S rRNA 
from Zea mays (GenBank AF168884), 5.8S rRNA from 
Platanus occidentalis (GenBank AF 1622 15) and 26S 
rRNA from Lamhertia inermis (GenBank AF274652) us- 
ing the BLAST program. The choice of these rRNA sources 
was based on the similarity between them and sugarcane 
rRNA. A match with an e-value less than 10"'° was the 
threshold to discard a read, a total of 8,473 reads being re- 
moved in this step. 

The next step was vector and adapter sequence mask- 
ing, using the cross-match program that replaced bases with 
an X if they were very similar to vector and adapter se- 
quences used in the clone libraries. This was followed by 
removing the vector and adapter sequences themselves 
from the reads by deleting the regions marked with an X. 
The actual treatment given to these ambiguous regions de- 
pended on where the X-regions were found and how many 
there were, an X-region being a contiguous masked sub-se- 
quence in a read. 

Classes were devised based on the analysis of histo- 
grams of the lengths of X-rcgions, distance of the X-region 
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Figure 1 - Overview of trimming procedure. White-headed arrows indi- 
cate the number of reads discarded in each step, with the percentage of to- 
tal shown in parenthesis. 



from the 5 ' and 3 ' ends and on the analysis of the number of 
ESTs falling into each class. These classes were as follows: 

Class 1. There were two distinct X-regions in the 
read, this being what is to be expected as the result of se- 
quencing a clone with a small insert. In this case only the 
sequence between the X-regions was kept. 

Class 2. There were more than two X-regions in the 
read, probably because of a low-quality vector. In this case 
we did not change the read. 

Class 3 . There was only one X-region of no more than 
300 bases that was less than 50 bases away from the 5 ' end. 
This was the case when the region from the X-region down 
to the 5' end probably consisted of vector sequences ex- 
tending from the sequencing priming site to the cloning 
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site. In this case we removed the X region together with the 
5' end. 

Class 4. There was only one X-region with more than 
300 bases that was less than 50 bases away from the 5' end. 
In this case the clone probably had no insert so we dis- 
carded the whole read. 

Class 5. There was only one X-region of at most 300 
bases that was 5 1 to 300 bases from the 5' end. In this case it 
was hard to decide what the insert was so the read was not 
changed. 

Class 6. There was only one X-region with more than 
300 bases that was 51 to 300 bases from the 5' end. This 
probably occurred when the X-region and the 3* end con- 
sisted of a vector sequence after the cloning site. In this case 
we removed both the X-region and the 3' end. 

Class 7. There was only one X-region of any length 
and it was at least 300 bases away from the 5' end. In this 
case we again removed both the X-rcgion and 3' end be- 
cause the deleted region probably consisted of a post clon- 
ing-site vector sequence. 

While removing X-regions any poly-A fragment 
close to them was also removed. A poly-A fragment was 
considered to be any region that scored at least 8 when 
aligned with a probe sequence of As (adenines) only. The 
scoring scheme added 1 for a match and -2 for a mismatch, 
gaps were given a high penalty (-8) because they should not 
occur. The poly-A had to be at most 10 bases away from 
X-regions. Alignments were performed using the SWAT 
program. Depending on the reading direction a poly-A can 
be read as poly-T, so a poly-T probe was used as well. The 
removal of X-regions discarded 7,780 sequences. 

The next step was quality-trimming, for which a win- 
dow of 20 bases was slid over every sequence in the set. 
Starting at the 3' end, the window was slid one base at a 
time, dropping the extreme base until 1 2 or less bases in the 
window had a quality value below 1 0, the process being re- 
peated for the 5' end. After quality-trimming, X-regions 
not further than 1 0 bases away from an end were removed. 
Quality-trimming removed 1 ,708 sequences from the set. 

The quality-trimming thresholds were chosen as fol- 
lows. A subset of 10,000 SUCEST sequences was ran- 
domly selected on the basis of (i) high similarity (BLASTX 
e- value below 10' 20 ) with protein sequences in the NCBI nr 
database (www.ncbi.nlm.nih.gov), (ii) the length of the 
matching nr sequence was enough to cover the EST and 
(iii) the region of similarity did not extend to the end of the 
EST. By using these criteria we had matches showing a re- 
gion of similarity that could, potentially, extend to the end 
of an EST. Cases where the region of similarity did not ex- 
tend to the end of the EST may have been due to the low 
quality of the EST sequence. 

The exact point where the region of similarity ended, 
the 'BLAST hit end' (BHE), was recorded for each EST in 
the set and then the set went through the quality-trimming 
procedure with varying values for the length of the window, 



quality threshold and number of bases below threshold. Ob- 
viously, high quality thresholds and low numbers of bases 
produced shorter reads. The difference between the trimm- 
ed position (TP) and the BHE (TP-BHE) was calculated 
and averaged. The results for a 20-base quality window are 
shown in Figure 2. The square in the figure indicates the se- 
lected threshold values and shows that, on average, 43 
bases after the BLAST hit end were kept. 

The next step was slippage-trimming, slippage being 
a sequencing artifact (Anon, 1998) which produces 'ech- 
oed' bases in sequences, i.e. for one occurrence of a nucleo- 
tide in the template the chromatogram shows several peaks 
(q.v. Figure 3). Although bases sometimes appeared with 
high 'background noise' (e.g. bases 2 1 5-230), generally the 
intensity of the echoed peak was such that the base caller in- 
correctly assigned a high quality value for the fake bases 
(e.g. bases 1 75-205) and this prevented quality-trimming of 
these artifacts. 

A method to identify slipped reads based on the se- 
quence of the read was devised, this method being able to 
find reads having many Tegions with repetitive bases (ech- 
oed regions). The product of echoed regions lengths (with 
at least 5 bases) was evaluated for each sequence. Echoed 
regions larger than 10 bases contributed 10 to the product 
only. Sequences with a product greater than 10 8 and echoed 
regions covering more than 20% of its length were dis- 
carded completely. This was the procedure adopted in most 
cases when slippage was caused by a long poly-A sequence 
at the 5' end of the read. But when a long poly-A at the 3' 
end was increasing the product only the poly-A (together 
with the remaining 3' sequence) was discarded. The thresh- 
old for poly-A identification in this situation was an align- 
ment with a score of at least 160. These thresholds were 
determined by varying the parameters for echoed region 
recognition, evaluating the products, and looking at several 
chromatograms in many product ranges. Slippage-trimm- 
ing removed 15,621 reads. 

The next step in the trimming procedure was another 
poly-A/T removal round, where poly-A/T scoring at 280 
and over was removed from sequences. Smaller poly-A/T, 




Bases below threshold 
Figure 2 - Distribution i>f the number of bases kept at the 3' end with a 
quality window size of 20, with respect to the best BLAST hits against nr 
(see text). The square and the bullet indicate the values used in the new and 
old trimming procedures, respectively. 
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Figure 3 - Consed (www.phrap.org) trace window of a slipped read. The background oflhe base letters indicates their phred quality. Darker colors 
spond to lower qualities. The numbers above the letters show their position in the read. 



scoring at least 30 and less than 20 bases away from one of 
the ends was also removed. This step removed 2,006 reads. 

The final step was to remove any read with less than 
100 bases or with less than 50 bases having phred quality 
greater or equal to 20. A total of 18,147 reads fell in this 
case. 

At the end of all the steps described above, 237,954 
reads were left with an average length of 641.6 ± 139.8 
bases (152.5 Mbp in total). The average number of bases 
with a phred quality greater or equal to 20 per read was 
397.8 ± 120.1. 

In contrast, the trimming method formerly used in the 
SUCEST project was simpler. That method started with 
only one round of very restricted poly-A removal, search- 
ing for 12 or more consecutive As adjacent to the vector. 
The final step was quality-trimming using the same scheme 
as above with a window length of 20, quality equal to 15 
and the number of bases equal to 8. For the reads used in the 
quality window experiment this combination of thresholds 
discarded 137 bases from the reads on average (relative to 
the BHE) as shown in Figure 2. This method applied to the 
original set of SUCEST reads resulted in 261,609 reads 
with average length of 512.1 ± 1 14.8 bases. The average 
number of bases with a phred quality greater or equal to 20 
per read was 392.4 ± 128.3. 

BLAST was used to compare the ESTs from the orig- 
inal set of reads in the SUCEST database with the genomes 
of Xylella fastidiosa, Xanthomonas citri, Escherichia cofi 
and other potential laboratory contaminants that could have 
been present in the libraries. A match of at least 100 bases 
and more than 90% identity resulted in the read being 
marked as probably resulting from contamination. A total 
of 1 14 ESTs were thus marked. Because there were so few 
matches, and the difficulty of deciding whether or not 
marked ESTs really were the result of contamination, these 



ESTs were not removed by either of the trimming proce- 
dures. 

Clustering 

For the SUCEST project it was necessary to estimate 
the redundancy of the clone libraries as they were se- 
quenced, which could be achieved by joining similar tran- 
scripts into clusters. Clustering results allowed project 
coordinators to decide when to stop sequencing any partic- 
ular library. 

Fragment assemblers were used for clustering. A 
fragment assembler is a program that takes a set of reads 
and their qualities as input, builds groups based on the over- 
laps of reads and creates a consensus sequence for the reads 
in each group. 

Reads processed by the old trimmer were assembled 
using the phrap program(version 0.990319, www.phrap. 
org) with the arguments set to predetermined values (pen- 
alty -15, bandwidth 14, minscore 100, shattergreedy) 
which made it more stringent and with quality data. This as- 
sembly, called 'old-trim', produced 8 1 ,223 clusters (4 1 ,582 
singletons). 

To cluster the reads trimmed by the new procedure, 
three different assemblies were performed and compared. 
Phrap was used with two sets of arguments, the default ar- 
guments (phrap-d assembly) and the more stringent argu- 
ments listed above (phrap-hs assembly). The CAP3 
program was used with its default arguments. Quality data 
was used for every assembly. Tabic 1 shows the cluster size 
(number of ESTs in a cluster) distribution for the assem- 
blies, as well as the number of equal clusters between them. 
Equal clusters arc those with the same reads. 

Two tests were performed for the assemblies. The 
first verified 'internal consistency' by checking every clus- 
ter with two or more reads for discrepant reads. To be 
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Table 1 - Cluster sizes distribution for CAP3, phrap-d and phrap-hs assemblies by the new trimming procedure. The X columns inhale the number 

of equal clusters between two assemblies, while the 'common' column shows the number of clusters equal m the three assemblies. The number of 
clusters obtained with the original trimming procedure are shown in the 'Old-trim' column. Cluster sizes represent the number of expressed sequence 

tags (ESTs) in a cluster. 
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phrap-hs X phrap-d 

13731 
5617 
2402 
1239 
676 



32202 
12440 
6752 
4225 
2856 
2098 
1582 
1245 
974 
776 
639 
492 
437 
366 
306 
273 
225 
177 
124 
143 
113 
105 
92 
80 
69 
56 
51 
44 
439 
69381 



442 
288 
202 
156 
105 
76 
71 
47 
42 
31 
25 
15 
11 
6 
10 
6 
3 
4 
4 
3 
2 
2 
1 
5 



18535 
9207 
5192 
3329 
2360 
1806 
1362 
1091 
913 
752 
607 
547 
454 
391 
390 
279 
273 
227 
177 
149 
130 
130 
100 
99 
109 
108 
59 
73 
857 
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99 
99 
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6 
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295 
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1227 
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1984 
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113 
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26 
18 
18 
11 
5 
5 
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2 
2 
3 
2 
5 
1 
1 
1 
0 



43141 
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10744 
3792 
1441 
697 
344 
231 
144 
99 
72 
44 
30 
32 
25 
13 
11 
8 
4 
2 
3 
3 
0 
1 

2 
1 
2 
1 
1 
1 

0_ 
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7421 
4482 
3110 
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1582 
1219 
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153 
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106 
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Figure 4 - Distribution of discrepant reads among the assemblies. As dis- 
crepant reads can only be calculated for clusters of two or more reads the 
number of reads belonging to such clusters in each assembly is shown in 
parentheses in the legend. 



discrepant, a read base must both disagree with the consen- 
sus base and have less than a 2% probability of being mis- 
called by the phred program. An x% discrepant read is a 
read with at least x% discrepant bases. Figure 4 shows the 
proportion of x% discrepant reads in each assembly, for 
values of x varying from 30 to 90 in steps of 10. 

The second test verified the 'external consistency' of 
the assemblies by comparing the consensi produced by a 
given assembly to each other using BLAST. Percentage 
identity was evaluated for end-overlaps of 200 or more 
bases found between two clusters, and Figure 5 shows a 
plot of the percentage of clusters having an identity of more 
than 75% with other clusters in a given assembly with re- 
spect to the total of possible overlaps within that set of clus- 
ters. 
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Figure 5 - Plot of external consistency test results. For a given assembly 
with n clusters the number of overlaps detected was divided by n(n- 1 )/2, 
which is the maximum number of possible overlaps for n clusters. 



DISCUSSION 

The trimming procedure described in this paper dis- 
carded 53,735 SUCEST reads, 18.4% of the total. In spite 
of this large number, it is worth noting that 16% of the dis- 
carded sequences were ribosomal RNA and 34% were 
smaller than 100 bases. We cannot exclude the possibility 
of that we have discarded useful reads with this procedure, 
but we tried to avoid this as much as possible. It is also ob- 
vious that not every artifact has been removed. For exam- 
ple, counting how many reads have a sub-sequence of at 
least 30 consecutive adenines in the output of the trimming 
procedure found 711 reads. Moreover, trimming is not a 
light computational task, taking 8.3 h to process all the 
SUCEST reads. 

Nevertheless, the influence of the quality of trimming 
on the final clustering is remarkable. For instance, it is hard 
to accept that the number of singletons in the old assembly 
are uniquely expressed sugarcane genes, and 8 1 ,223 was an 
unreasonably large number of clusters. Good trimming also 
shortened the CPU time required for clustering, the phrap 
program took 9.2 h to build the phrap-d assembly and 6.5 h 
for phrap-hs assembly, while the CAP3 program took 
77. 1 h. To assemble the old set of trimmed reads, phrap 
took 5 times more time than it spent to produce phrap-hs, 
while CAP3 ended abnormally when fed with that data-set. 

We have used a fragment assembler for the whole set 
of ESTs in the SUCEST database and, consequently, the bi- 
ological definition of 'one cluster, one gene' cannot be 
used. A SUCEST cluster can be better defined as 'a set of 
very similar transcripts'. 

Building consensus sequences for clusters is useful in 
several respects. Firstly, electing a representative sequence 
for each cluster results in a smaller set of sequences to work 
with. Secondly, the portions of representative sequences 
covered by more than one read are more accurate than the 
reads themselves. Thirdly, representative sequences may 
be longer than individual reads, increasing their usefulness. 
This third point was confirmed by the fact that 33% of rep- 



resentative sequences with homologous genes in other or- 
ganisms were actually full-length sequences (Vettore et al., 
2001). 

However, chimeras may result from assembling 
ESTs and a further problem is that using a fragment assem- 
bler for clustering will put alternatively spliced forms of 
genes into different clusters. But in a dodecaploid organism 
like sugarcane it is especially difficult to distinguish alleles 
of genes from very conserved multigene families based on 
similarity. 

The assembly produced by the CAP3 program was 
taken as the 'official' clustering for the SUCEST project. 
This decision was based on the result of the internal and ex- 
ternal consistency tests, where the CAP3 assembly outper- 
formed both the phrap-hs and phrap-d assemblies. Internal 
consistency shows that the CAP3 assembly has a lower in- 
cidence of discrepant reads in clusters when compared to 
the other assemblies. External consistency reveals that the 
CAP3 program produces fewer redundant clusters, i.e. two 
or more cl usters that probably should be condensed to a sin- 
gle cluster. Unfortunately, wc performed no comparisons 
of our results with those that would be produced using some 
other method described in the literature. This is an interest- 
ing investigation to perform in the future. 

The trimming and clustering procedures described in 
this paper hide a large amount of computational time and 
human work spent looking at the data, testing insights, ad- 
justing parameters, and designing the pipeline. There are no 
'magic numbers'. We believe that these guidelines may be 
used in some other EST projects, although using these pro- 
cedures with different data sets may require some adjust- 
ments. The need for many cycles of adjustment and testing 
is a natural consequence of the nature of the noise present in 
ESTs, the limitations posed by technological issues and the 
lack of a complete understanding of the biological pro- 
cesses occurring within cells. 

RESUMO 

O metodo de clustering adotado no Projeto SUCEST 
(Sugarcane EST Project) tinha varios problemas (muitos 
clusters, presenca de scquencias de ribossomo etc.) Nos 
assumimos a tarefa de reprojetar todo o processo de cluster- 
ing, propondo uma "limpeza" inicial mais cuidadosa das 
seqviencias. Neste artigo as estrategias de limpeza das se- 
qiiencias e de clustering sao descritas em detalhc, incluindo 
os numeros oficiais do projeto (237,954 ESTs e 43,141 
clusters). 
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